DICOM Processing Script

Summary: Take DICOM images and process them for use in neon models.

NOTES:

  • DICOM stands for "Digital Imaging and COmmunication in Medicine". It is the standard for all medical imaging from MRI to CT to Ultrasound to whatever.
  • The standard was created so that devices manufacturers would have a common format for hospitals to integrate into their digital imaging systems. That's good for us because it gives us a standard way to load and process many types of medical images. If we can get a good pipeline going, then this script might be useful for any medical imaging study.
  • MRI and CT are stored as 2D slices which can be combined to form a 3D volume. So for any given patient we may have several hundred slice files that correspond to the same scan. This is different than most of our image data models because we need to incoporate all of these slices into one tensor. We definitely want to use the 3D volume because things like tumors and lesions (and other bad things that cause illness) don't limit themselves to a single 2D slice. Therefore, we've got to load this into our model as a 4D tensor (channel, height, width, depth).
  • The slice thickness varies but is typically 1 to 3 mm. It depends on the type of study (MR/CT) and the parameters set at the start of the scan. Be sure to get a good handle on the height, width, and depth parameters of the DICOM files so that you are importing consistent tensors into the data loader. This is something I need to look into because ideally we'd like to standardize the tensors (for example, 1 mm x 1 mm x 1 mm voxels or something like that).
  • The pixels are usually stored as uint16 precision (however I'm seeing several that are 32-bit too). I'm not sure if we need to change that to something with less precision. If we can keep the full 16-bit precision, then that would be preferable. There may in fact be anomalies that involve a very minor difference in the contrast between two adjacent regions. This is an open question for analysis.
  • Along with the actual pixel information, the DICOM file also contains lots of metadata, such as slice thickness, pixel resolution, image orientation, patient orientation, and type of study (i.e. MRI, CT, X-ray).
  • This assumes that your DICOM images are stored in the directory DATA_DIR. Within DATA_DIR there should be a separate folder for each patient's scans. The files usually end in the .dcm extension (e.g. "0a291d1b12b86213d813e3796f14b329.dcm" might be one slice for one patient). The SimpleITK library we use will load slices within a given patient's directory all at once into a 3D object.

In [1]:
from matplotlib import pyplot as plt, cm
import numpy as np
import glob, os

%matplotlib inline

SimpleITK for reading DICOM

SimpleItk is an open-source library for reading and processing 3D models. It was particularly designed for medical imaging and was built on top of the Insight and Segmentation and Registration (ITK) toolkit sponsored by the National Library of Medicine.

What's nice about SimpleITK is that it has pre-built methods to read all of the DICOM slices into a 3D object and perform segmentation and morphological operations on that 3D object. I believe it automatically arranges the slices in the correct order. It also can handle compressed DICOM files.


In [2]:
#!pip install SimpleITK
import SimpleITK as sitk

Ask the system how many patients are in the data directory.

Typically the DICOM files for each patient are stored in a different sub-directory.

So the blue folders (e.g. "00cba091fa4ad62cc3200a657aeb957e") represent different patients and the *.dcm files contain slices for one study (3D image) for this patient.

You'll have to tweak the code if your data is stored differently.


In [3]:
DATA_DIR = "/Volumes/data/tonyr/dicom/Lung CT/stage1"
    
patients = glob.glob(os.path.join(DATA_DIR, '*')) # Get the folder names for the patients

if len(patients) == 0:
    raise IOError('Directory ' + DATA_DIR + ' not found or no files found in directory')
    
print('Found the following subfolders with DICOMs: {}'.format(patients))


Found the following subfolders with DICOMs: ['/Volumes/data/tonyr/dicom/Lung CT/stage1/0030a160d58723ff36d73f41b170ec21', '/Volumes/data/tonyr/dicom/Lung CT/stage1/0092c13f9e00a3717fdc940641f00015', '/Volumes/data/tonyr/dicom/Lung CT/stage1/00cba091fa4ad62cc3200a657aeb957e', '/Volumes/data/tonyr/dicom/Lung CT/stage1/00edff4f51a893d80dae2d42a7f45ad1', '/Volumes/data/tonyr/dicom/Lung CT/stage1/013395589c01aa01f8df81d80fb0e2b8']

Set up a pipeline for reading all DICOM images for one patient at a time

SimpleITK has a pipeline for reading all of the DICOM slices for one patient into one object. So we'll essentially have a 3D array with the pixel values from every slice for this patient.

First let's just do a sanity check with one patient. Then, we'll re-use this code to grab the entire dataset into one large hdf5 file.


In [4]:
for patientDirectory in patients[:1]:  # TODO: Go through just one patient for testing. Later we'll use the whole loop
    
    # The next 4 lines set up a SimpleITK DICOM reader pipeline.
    reader = sitk.ImageSeriesReader()
    
    # Now get the names of the DICOM files within the directory
    filenamesDICOM = reader.GetGDCMSeriesFileNames(patientDirectory)
    reader.SetFileNames(filenamesDICOM)
    # Now execute the reader pipeline
    patientObject = reader.Execute()

patientObject description

Now patientObject is a SimpleITK object which includes the 3D array of the pixels for this patient's DICOM. Let's print out some of the meta values to get a sense of the DICOM information.


In [5]:
print('This is a {} dimensional image'.format(patientObject.GetDimension()))
print('There are {} slices for this patient.'.format(patientObject.GetDepth()))
print('Each slice is {} H x {} W pixels.'.format(patientObject.GetHeight(), patientObject.GetWidth()))
print('Color depth of the image is {}.'.format(patientObject.GetNumberOfComponentsPerPixel()))
print('The real world size of these pixels (i.e. voxel size) is {} mm H x {} mm W x {} mm D [slice thickness]'.
     format(patientObject.GetSpacing()[0], patientObject.GetSpacing()[1], patientObject.GetSpacing()[2]))
print('The real world origin for these pixels is {} mm. This helps us align different studies.'.format(patientObject.GetOrigin()))


This is a 3 dimensional image
There are 265 slices for this patient.
Each slice is 512 H x 512 W pixels.
Color depth of the image is 1.
The real world size of these pixels (i.e. voxel size) is 0.582031 mm H x 0.582031 mm W x 1.25 mm D [slice thickness]
The real world origin for these pixels is (-139.10000610351562, -135.5, -337.43499755859375) mm. This helps us align different studies.

Helper function for displaying images

This is just a helper function to properly display our figures in matplotlib. I modified it from this website: https://pyscience.wordpress.com/2014/10/19/image-segmentation-with-python-and-simpleitk/


In [6]:
def sitk_show(img, title=None, margin=0.05, dpi=40):
    
    # Here we are just converting the image from sitk to a numpy ndarray
    ndArray = sitk.GetArrayFromImage(img)
    
    # Conversion note:
    # SimpleITK stores things as (x, y, z). numpy ndarray is (z, y, x). So be careful!
    
    spacing = img.GetSpacing() # This returns the realworld size of the pixels (in mm)
    figsize = (1 + margin) * ndArray.shape[0] / dpi, (1 + margin) * ndArray.shape[1] / dpi
    extent = (0, ndArray.shape[1]*spacing[1], ndArray.shape[0]*spacing[0], 0)
    fig = plt.figure(figsize=figsize, dpi=dpi)
    ax = fig.add_axes([margin, margin, 1 - 2*margin, 1 - 2*margin])

    plt.set_cmap(cm.bone)  # TODO: This might be specific to CT scans(?)
    ax.imshow(ndArray,extent=extent,interpolation=None)
    
    if title:
        plt.title(title)
    
    plt.show()

In [7]:
sitk_show(patientObject[:,:,0], title='Slice#0', dpi=80)  # Display slice 0
sitk_show(patientObject[:,:,100], title='Slice #100', dpi=80) # Display slice 100
sitk_show(patientObject[:,:,250], title='Slice #250', dpi=80) # Display slice 250


Now load the entire set of DICOM images into one large HDF5 file

Alternatively, we could do some other pre-processing of the images (e.g. normalizing them, aligning them, segmenting them) and then save to HDF5.

However, for right now let's just load the data without any other pre-processing.


In [8]:
import h5py
import ntpath

The 'input' expects the array to be flattened!

The HDF5Iterator expects to get a 2D array where the rows are each sample and the columns are each feature. So for a CxHxW set of images, the number of features should be the product of those 3 dimensions.

I need to add a depth dimension (D) for the slices. So I'll have a 2D array that is # samples x (CxHxWxD).


In [9]:
def getImageTensor(patientDirectory):
    """
    Helper function for injesting all of the DICOM files for one study into a single tensor
    
    input: 'patientDirectory', the directory where the DICOMs for a single patient are stored
    outputs:
            imgTensor = a flattened numpy array (1, C*H*W*D)
            C = number of channels per pixel (1 for MR, CT, and Xray)
            H = number of pixels in height
            W = number of pixels in width
            D = number of pixels in depth
    """
    reader = sitk.ImageSeriesReader()  # Set up the reader object
    
    # Now get the names of the DICOM files within the directory
    filenamesDICOM = reader.GetGDCMSeriesFileNames(patientDirectory)
    reader.SetFileNames(filenamesDICOM)
    # Now execute the reader pipeline
    patientObject = reader.Execute()

    C = patientObject.GetNumberOfComponentsPerPixel() # There is just one color channel in the DICOM for CT and MRI
    H = patientObject.GetHeight()  # Height in pixels
    W = patientObject.GetWidth()   # Width in pixels
    #D = patientObject.GetDepth()  # Depth in pixels
    D = 128   # Let's limit to 128 for now - 
        
    # We need to tranpose the SimpleITK ndarray to the right order for neon
    # Then we need to flatten the array to a single vector (1, C*H*W*D)
    imgTensor = sitk.GetArrayFromImage(patientObject[:,:,:D]).transpose([1, 2, 0]).ravel().reshape(1,-1)
            
    return imgTensor, C, H, W, D

Loop through the patient directory and load the DICOM tensors into HDF5 file

HDF5 allows for the dataset to be dynamically updated. So this should iteratively append new DICOM tensors to the HDF5 file. Otherwise, we'd have to load all of the files into memory and quickly run out of space.

TODO (priority low): I think there's a parallel way of writing to HDF5. I might be able to speed things up by having parallel threads to load different patients and append them to the HDF5.


In [10]:
outFilename = 'dicom_out.h5'  # The name of our HDF5 data file

with h5py.File(outFilename, 'w') as df:  # Open hdf5 file for writing our DICOM dataset

    numPatients = len(patients)  # Number of patients in the directory

    for patientDirectory in patients[:1]:  # Start with the first patient to set up the HDF5 dataset

        patientID = ntpath.basename(patientDirectory) # Unique ID for patient

        print('({} of {}): Processing patient: {}'.format(1, numPatients, patientID))

        imgTensor, original_C, original_H, original_W, original_D = getImageTensor(patientDirectory)
        
        dset = df.create_dataset('input', data=imgTensor, maxshape=[None, original_C*original_H*original_W*original_D])
        
        # Now iterate through the remaining patients and append their image tensors to the HDF5 dataset
        
        for i, patientDirectory in enumerate(patients[1:]): # Now append the remaining patients
            
            print('({} of {}): Processing patient: {}'.format(i+2, numPatients, ntpath.basename(patientDirectory)))

            imgTensor, C, H, W, D = getImageTensor(patientDirectory)
            
            # Sanity check
            # Let's make sure that all dimensions are the same. Otherwise, we need to pad (?)
            assert(C == original_C)
            assert(H == original_H)
            assert(W == original_W)
            assert(D == original_D)
            
            # HDF5 allows us to dynamically resize the dataset
            row = dset.shape[0] # How many rows in the dataset currently?
            dset.resize(row+1, axis=0)   # Add one more row (i.e. new patient)
            dset[row, :] = imgTensor  # Append the new row to the dataset
           
        
    # Output attribute 'lshape' to let neon know the shape of the tensor.
    df['input'].attrs['lshape'] = (C, H, W, D) # (Channel, Height, Width, Depth)

    print('FINISHED. Output to HDF5 file: {}'.format(outFilename))


(1 of 5): Processing patient: 0030a160d58723ff36d73f41b170ec21
(2 of 5): Processing patient: 0092c13f9e00a3717fdc940641f00015
(3 of 5): Processing patient: 00cba091fa4ad62cc3200a657aeb957e
(4 of 5): Processing patient: 00edff4f51a893d80dae2d42a7f45ad1
(5 of 5): Processing patient: 013395589c01aa01f8df81d80fb0e2b8
FINISHED. Output to HDF5 file: dicom_out.h5

Questions:

  • I've confirmed with Evren that neon wants a fixed size tensor. However, the number of slices varies from patient to patient. For the first pass, I think I am just going to define the depth to be the largest size and pad zeros for studies that have fewer slices.
  • Do I need to perform registration (i.e. alignment so that the spatial positions are the same between patients)?

Does the new HDF5 load correctly using neon's HDF5Iterator?

Yes, it looks like it. Still need more sanity checking to make sure that the backend is handling the additional depth dimension correctly.


In [11]:
from neon.data import HDF5Iterator  # Neon's HDF5 data loader
from neon.backends import gen_backend

In [12]:
be = gen_backend(backend='cpu', batch_size=1)

In [13]:
train_set = HDF5Iterator(outFilename)

In [14]:
train_set.get_description()


Out[14]:
{'config': {'hdf_filename': 'dicom_out.h5', 'name': 'HDF5Iterator_0'},
 'type': 'neon.data.hdf5iterator.HDF5Iterator'}

In [15]:
plt.figure(figsize=[10,10])
plt.title('Slice #100')
plt.imshow(train_set.inp[0,:].reshape(512,512,128)[:,:,100]);  # Show the slice 100 of patient 0


HDF5 Loader seems to correctly import the images from the HDF file

Hopefully this means it will load into the model as a 4D tensor (CxHxWxD).

The HDF5 data loader seems to just unravel the entire tensor into a vector. So it's a 2D array where the rows are the different patients and the columns are the CxHxWXD tensor unraveled as a vector.


In [16]:
train_set.cleanup()  # Need to close the HDF5 file when we are finished

FUTURE

DICOM Alignment

Here's the document explaining the meta data fields for DICOM images.

ftp page

http://nipy.org/nibabel/dicom/dicom_orientation.html


In [ ]: